DOCUMENT RESUME 



ED 252 563 



TM 850 032 



AUTHOR 
TITLE 



INSTITUTION 

SPONS AGENCY 
PUB DATE 
GRANT 
NOTE 
PUB TYPE 

EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Herman, Joan 

A Practical Approach to Local Test Development. 
Resource Paper No. 6. Research into Practice 
Project. 

California Univ., Los Angeles. Center for the Study 
of Evaluation. 

National Inst, of Education (ED), Washington, DC. 
Nov 84 

NIE-G-84-0112-P4 
63p. 

Guides - Non-Classroom Use (055) 
MF01/PC03 Plus Postage. 

"Criterion Referenced Tests; Elementary Secondary 
Education; Instructional Development; Models; Rating 
Scales; Student Evaluation; "Teacher Made Tests; 
*Test Construction; Test Format; *Test! Items; Test 
Manuals; Test Validity 

Curriculum Related Testing; "Domain Referenced Tests; 
*Test Specifications 



ABSTRACT 

This resource paper is a guide for planning and 
developing instruct ionally relevant tests of student learning at the 
classroom, building, or district level. It is based on a model of 
instruction and testing which systematically uses assessment 
information to support and facilitate instructional improvement. 
Course goals and objectives are first translated into domain 
specifications which are used to link testing and instruction. A 
domain specification contains six components: (1) domain description; 
(2) content . its; (3) distractor limits and response-criteria; (4) 
item format; (b) student directions; and (6) sample items. Test items 
are then developed to match domain specifications. Standard item 
construction rules are given for both constructed responses (essay, 
short answer, completion) and selected responses true-false, 
matching, multiple choice). Items are then judged for their match to 
the six domain components plus linguistic and thinking complexity 
using the Item Rating Scales. When all eight categories have been 
individually scored on scales from 0 to 10, an overall rating is 
calculated. Guidelines are given for interpreting overall item rating 
for acceptance, rejection, or revision. Appendices contain: (1) 
sample domain-referenced test specifications; (2) item difficulty 
levels; and (3) materials used in the item rating process. (BS) 



******************************************************* 

* Reproductions supplied by EDRS are the best that can be made * 

* from the original document. * 
*********************************************************************** 

* 

ERIC 



vD 

LPs 

cu 
in 

CO 



DELIVERABLE - NOVEMBER 1984 
RESEARCH INTO PRACTICE PROJECT 



Joan L. Herman 
Project Director 



U.S. DEPARTMENT OF EDUCATION 

NATIONAL INSTITUTE OF EDUCATION 

EDUCATIONAL RESOURCES INFORMATION 

CPNTER IfcRIC) 
^ This document has been repioduced as 

received from the person 01 organiiation 

originating u 

Minor changes hav« been made to improve 
reproduction quality 

• Point* (d vuiwor opinions slated in thisdocu 
mpnl do not necessarily represent official NlE 
positioi 01 policy 



Resource Paper No. 6, 1984 
A Practical Approach to Local Test Dev elopment 



Grant Number 
NIE-G-0112-W 



A 



In 



CENTER FOR THE STUDY OF EVALUATION 
Graduate School of Education 
University of California, Los Angeles 



ERIC 



A PRACTICAL APPROACH TO 
LOCAL TEST DEVELOPMENT 



James Burry 
Joan Herman 
Eva L. Baker 



Resource Paper No. 6 
1984 



CENTER FOR THE STUDY OF EVALUATION 
Graduate School of Education 
University of California, Los Angel e 



The project presented or reported herein was 
supported pursuant to a grant from the National 
Institute of Education, Department of Education. 
However, the opinions expressed herein do not 
necessarily reflect the position or policy of the 
National Institute of Education and no official 
endorsement by the National Institute of Education 
should be inferred. 



•1 



TABLE OF CONTENTS 



INTRODUCTION 

Potential Users of the Guide 
Approach to Testing 
Structure of the Guide 

THE COMPONENTS OF DOMAIN SPECIFICATIONS 

Overview to Components 
Analyzing Intentions & Expectations 
Developing the Domain Specification 
Domain Description 
Content Limits 

Dlstractor Limits/Response Criteria 

Format 

Directions 

Sample Item 

Summary 

Sample Domain Specification 

ITEM CONSTRUCTION RULES 

Constructed Responses 
Selected Responses 

THE ITEM RATING SCALE 

Background 

Using the IRS 

Overall Item Rating 

Interpreting an Item's Overall Rating 

USING THE GUIDE 

CONCLUSION 

REFERENCES 

APPENDIX A: SAMPLE DOMAIN SPECIFICATIONS 
APPENDIX B: DIFFICULTY LEVELS 
APPENDIX C: RATING MATERIALS 



INTRODUCTION 



Potential Users of the Guide 

This resource paper offers a guide for planning and developing 
Instructional^ relevant tests of student learning. The planning and 
development approach we describe responds to findings from several years of 
CSE research on the uses of tests and the broader evaluation systems of 
which they are often a part. The heart of these findings is that school 
practitioners needs tests which match their curriculum, which are useful 
for instructional planning, and which are fair and valid for evaluation. 

For example, at the classroom level, teachers rely to a great extent 
on the results of tests that they themselves develop 1n large part because 
these tests are sensitive to their Instructional intentions and are viewed 
as most appropriate for their students in both content and format. They 
want tests that reflect their Instruction and that provide Information they 
can use to monitor student learning (Dorr-Bremme, 1983). 

At the building level, principals too want Information that can be 
used to judge their schools' progress. Like teachers, principals want 
tests which match their actual school programs but are hesitant to use the 
results of teacher developed tests (Burry et al , 1982). Perhaps, like some 
others, they have reservations about the quality of such tests. 

Those at the district level also echo concern for the instructional 
relevance of testing programs. Complaints are often made that the 
standardized, norm-referenced tests which are frequently administered do 
not match up with districts' Instructional offerings and Intentions 
(O'Shea, 1981) and are inappropriate for accountability purposes. In 
response, more and more districts are developing their own tests to match 
their curricular continuum (Burry et al , 1981). 



The test development process described in this guide reflects the need 
for tests which match curriculum and instruction. The specification of 
curricular and instructional intentions, in fact, is the core of the 
development process. Because these intentions guide item development, the 
validity and usefulness of the testing is increased. 

Because instructional intentions can be defined at the class, school, 
or district levels, the test development process described in the guide is 
useful in creating valid and useful tests for all these levels. The guide 
is thus appropriate for a variety of users: 

0 groups of teachers and their principals can use the guide to 
develop tests that reflect classroom/building needs; 

0 district testing specialists can use the guide where there is a 
need for tests, 1n addition to those developed for teacher use, 
that reflect district progress; 

° groups of teachers and district staff can use the guide and work 
together to meet both kinds of needs. 
Approach to Testing 

Is this guide any different from other materials purporting to be of 
value in local test development efforts? We think it is, because it 
reflects a concern for fairness and utility in testing. It leads to tests 
that match curricular intentions and have relevance for instructional 
decision making. It emphasizes the need to integrate the acts of teaching 
and testing. What is taught provides the basis for what will be tested, 
and the results of the testing can then feed back to the ongoing business 
of teaching. 



3 



The Idea is not, as some have suggested, that tests ought to drive the 
curriculum, nor that teacners, strictly speaking, ought to "teach to the 
test." Rather, both testing and instruction ought to reflect significant, 
agreed upon curriculum goals and objectives. Tests should measure 
important class, school, and district objectives, and classroom instruction 
should provide students with an opportunity to attain those objectives. 
The model, as displayed in Figure 1, is a simple one. 

Figure 1 displays a model of instruction and testing which 
systematically uses assessment information to support and facilitate 
instructional improvement. As the figure implies, state, district, legal 
and other requirements, available tests and other instructional materials, 
and professional judgments are Synthesized to arrive at goals and 
objectives. These goals and objectives then serve as the guidepost for 
designing instruction and tests. Because both testing and the 
instructional program mirror the same goals and objectives, test results 
can be used to identify areas where individuals may need more help, where 
additional class instruction is needed, and where the instructional program 
(the next time around) can be strengthened and improved. Because they 
match what schools are trying to accomplish, the results also provide a 
fair and valid measure of effectiveness. 

You'll notice that Figure 1 includes an additional element, labeled 
"domain specifications." These specifications clarify the nature of the 
goals and objectives that are to be taught and provide a conceptual map 
that can guide both testing and instruction. They likewise provide a 
public and open model of exactly what is expected al all --a clear 
statement of the knowledge, content, skills, and procedures that teachers 

ERIC ° 



4 



Figure 1* 

DOMAIN REFERENCED VIEW OF INSTRUCTION AND TESTING 



STATE AND LEGAL 
REQUIREMENTS 



TEXTBOOKS 



TEST 



TEST RESULTS 



INDIVIDUAL 

GROUP/CLASS 

PROGRAM 



PROFESSIONAL 
JUDGMENTS & PREFERENCES 



COURSE GOALS 
AND OBJECTIVES 



DOMAIN 
SPECIFICATIONS 



i 




INSTRUCTIONAL 
IMPLICATIONS 

INDIVIDUAL STRENGTHS 
& WEAKNESSES 



REMEDIATION/ACCELERATION 
NEEDS 

PLACEMENT DECISIONS 



MASTERY/NON-MASTERY 



PROGRAM EVALUATION 



OTHER IMPLICATIONS 



GRADING 
CERTIFICATION 



* Taken from Herman, J.L., Testing and Instructional Improvement: An 
Integrated Test Development Process. 



9 

ERIC 



9 



Intend to teach and that students are expected to learn. Domain 
specifications, as described later, are the most arduous part of the test 
development process. They are also the critical link which enables the 
integration of testing and instruction and helps to assure that tests are 
sensitive to a school's instructional program, are targeted on meaningful 
skills, and that the entire testing process is fair and useful. 
Structure of the Guide 

The guide 1s set out as follows: 

We begin with a description of domain specifications and exemplify 
each of the major components included in the specification. This section 
develops ongoing Illustrations of .each domain component and concludes by 
offering a complete sample domain-referenced test specification. Others are 
provided 1n Appendix A. 

The next section of the guide offers some generally-held principles of 
item construction for each of the major Item forms in the constructed 
response and the selected response modes. 

The final section offers procedures for ensuring that Items written to 
assess a given domain do Indeed match their specifications. We provide a 
scale for this purpose, along with procedures for using the scale to judge 
an Item's fit with each of the elements 1n the domain specification, for 
Interpreting the meaning of the final rating of fit the item receives, and 
for deciding what that rating implies for modifications 1n the Item or its 
specification. 

THE COMPONENTS OF DOMAIN SPECIFICATIONS 
Overview to Components 

A domain specification includes six major components as follows: 
First, the domain description focuses on what's expected of the 



10 



student in a particular area. 

Second, the content limits set the range of content that can be used 
to write test items. This step has an option for developing selected 
response items or constructed response items. 

A selected response item presents the student with a question or 
problem and alternative answers to the question. The student's job is to 
pick the correct answer. A constructed response item, on the other hand, 
asks the student to create an answer for a question or problem. 

Third, the distractor limits describe the wrong answers that may be 
used as alternatives for s elected response items. The response criteria , 
which is the counterpart to distractor limits, provide the rules for 
judging a student's constructed response. 

Fourth, the format describes the item presentation form. 

Fifth, the directions tell the student what he or she is supposed to 
do in answering the questions. 

And sixth, a sample item , reflecting the rules in 1 to 5 above, is 
provided. 

Each of these components is clarified in the following sections. 
Analyzing Intentions and Expectations 

Domain specifications begin by considering what 1s to be taught and 
assessed. The nucleus of a domain specification might begin by stating the 
principal outcome expected of students: For example: 



7 

0 Writing a paragraph 
If writing a paragraph were to be the subject of a full domain description, 
what would be some of the skills we might expect of students as they write 
their paragraphs? Perhaps they would be: " 

0 stating a main idea 

0 offering supporting detail s 

0 using complete sentences to form the paragraph 

0 using correct spelling, punctuation, and grammar 

Let's take a look at the nucleus of another domain specification. 
Perhaps we want students to be able to 

0 Identify tria ngles 
and after Instruction we would want students to be able to select as 
triangles from among various geometric shapes only those which have: 

0 three sides 

° straight sides 

0 enclosed shape 

The instructional Implications of such a domain specification should 
be obvious; that 1s, how the specification can be used to Identify critical 
features of both Instruction and testing. For example, we have said that 
the defining features of triangles are three sides, straight sides, and a 
closed shape. But these features represent only a preliminary definition. 
Even 1n this simple example, a number of instructional questions can be 
raised about the kinds of discriminations, within the broad domain, that we 
Want students to be able to make. Do we want them to be able to identify 
the generic triangle shape only? To be able to Identify Isosceles 
triangles? equilateral triangles? right triangles? Some of these? 

ERIC 



8 



All of these? Why? Do we wish to set conditions or barriers that we want 
students to be able to cope with as they go about Identifying triangles, 
such as triangles standing on their vertices as opposed to their bases? If 
so, why? 

We raise these Issues to underscore, once again, the notion that 
domain specifications attempt to integrate the acts of instruction and 
assessment. They do tn1s by precisely specifying Instructional intentions, 
expected student outcomes, and accurate measures of these outcomes. The 
nature of these Intentions and expectations guide the development of the 
six major components of the domain specification whichwe alluded to 
earlier, and which we'll now begin to look at more closely. 
Developing the Domain Specification 

The first thing to do before writing a domain specification is to set 
Its broad focus. First, what 1s the subject matter that 1s to be tested — 
math; English mechanics; English composition? Second, what 1s the Intended 
grade level of the Instruction and the assessment, and how might this level 
affect the readability and Intended difficulty of the Items that are to be 
developed on the basis of the domain specifications? Third, what kind of 
Items will be used -- selected response; constructed response? Fourth, at 
what cognitive level do we want the students to operate in demonstrating 
their knowledge during assessment -- knowledge; comprehension; application; 
analysis; synthesis; evaluation? (See Appendix B for a description of each 
of these levels.) 

The answers to these questions help to keep the domain specification 
properly focused as we begin to write up Its components. If we plan to 
develop a specification for. selected response items, our specification will 



ERIC 



13 



contain: Domain description; content limits; abstractor limits; format, 
directions, and sample item. If we plan to develop a specification for 
constructed response items, our specification will differ from the above 
description only by replacing distractor limits with response criteria. 

In the next section, we will describe each o^ these domain specifica- 
tion features and give examples as we go along. 
Domain Description 

The domain description provides a broad but operational definition of 
the behavior expected of the students 1n a particular content area. This 
description may consist of an objective or an explanation of a task and/or 
its components. It may Include performance conditions, although these 
conditions will be specified in greater detail later 1n the specification. 

Here are some examples of domain descriptions: 

0 Math -- Identifying shapes as triangles 

0 English mechanics -- applying the rules of capitalization 

° English composition — writing a legible, wel 1 -organized, 
. and grammatical ly correct paragraph 1n which a position 
is taken and supported. 



This description should give a specific picture of the domain of Interest 
to the person who will use the domain specifications as a blueprint for 

developing test items. 
Content Limits 

Content limits describe the ballpark from which items can be written. 
If the item does not fit 1n the ballpark, then 1t is not assessing what we 
want 1t to assess. Therefore, the content limits must provide a careful 
description of the range of eligible content from which test items may be 



10 



written. This description may Include rules for creating questions, rules 
for generating prompts, cues, or additional materials, such as pictures, 
graphs, reading selections. 

Note that here we are talking about the "question" or task part of the 
item; that is, the stem. Other parts of the Item, such as Its distractors 
or its scoring criteria, are specified later 1n the process. 

The nature of the content limits will vary depending on whether we are 
writing a domain description for selected response items or for constructed 
response Items. 

For Selected Response Items ; A selected response Item asks the 
student to choose an answer from a number of given alternatives such as 
true-false, matching, multiple choice. Content limits for selected 
response Items, therefore, need to define and restrict the characteristics 
of the item stem and any additional material Included 1n the presentation 
of the question or problem. 

Here are two examples of content limits, building on the first two 
examples we began with 1n the domain description section, for selected 
response items: 



° Ma * n tne student will be asked to select the triangle 
From among four shapes, only one of which 1s a 
triangle. Permissible shapes will include 4 or more 
sided figures which are linear; 3 sided figures 1n 
which one side 1s curvilinear and circles. Triangles 
Included in the test will reflect the following: 
equilateral, Isosceles, obtuse, and acute. 

° English mechanics — the student will be presented with a 
sentence and asked to select the word that 1s impro- 
perly capitalized. The sentence will contain at least 
four capitalized words, one of which 1s Improperly 
capitalized. The- following rules will be used in 
determining correct and incorrect capitalization. 



ERJ.C 



15 



Let's take a look at these content limits for a moment. In the math 
content limit, in this example, we are making a rule that only four shapes 
can be presented, and only one of them can be a triangle. This means that 
an item containing three shapes, five shapes, or more than one triangle 
would not meet the specified content limits. More Importantly, we are also 
describing the kinds of shapes that can be used 1n the assessment. 

In the content limits described for the Englsh mechanics domain, we 

are specifying that the sentence. contain at least four capitalized words, 

and only one of them can be Improperly capitalized. An Item with three 

capitalized words would not match the content limits as described 1n the . 

example ; a sentence with six capitalized words would. A sentence with more 

than one Improperly capitalized word would also fall to match the described 

content limits. And more substantively, a capitalized word which did not 

exemplify one of the specified rules' would violate the specification and be 

an unfair measure of instruction. 

For constr ucted response items . Constructed response items provide 
—————— — i— — — , 

students with a question and/or prompt and asks them to generate, rather 
than select, a response. Writing an essay, supplying a short answer, or 
completing an incomplete statement are typical constructed responses. 
Let's take a look now at an example of a content limit for a constructed 
response Item in the third domain we are Illustrating here -- English 
composition: 



0 English composition — The student will be presented with a 
topic with which most high school students 1n this school 
would be familiar. This could be a topic dealing with a 
situation commonly encountered 1n dally living. 

The i topic must embody an Issue which permits the student 
to take one of two sides; I.e., 1n favor of or opposed to 
the proposition described. 

16 



12 



The prompt to the student will have three parts. One 
sentence will provide the student with brief background 
regarding the Issue, with both the pro and con positions 
expressed 1n this sentence. This sentence will be labeled 
as Background . 

The background sentence will be followed by the Assignment 
for the student which consists of a one-sentence task 
description directing the student to write a paragraph 1n 
which he/she 1s 1n favor of, or opposed to the topic 
proposition. 

The assignment description will be followed by a short (no 
more than four sentences) paragraph giving the student 
sufficient detail to fully understand the assignment, 
expectations for the nature of the student product (e.g., 
take a position and support 1t with at least two 
arguments) and the nature of the criteria that will be 
used to judge the response. 



Obviously, the content limits for the English composition assignment 
provide a lot of content detail about the nature of the task that 1s to be 
presented to students - the kinds of topics which are appropriate, the 
prompting which is to be provided, and how the assignment 1s to be framed. 
A question violating the content would need to be modified or replaced to 
meet the content limits. 

The careful detail 1n the content limits, however, is necessary for 
several important considerations. Each student should bring a common 
understanding of the assignment to the task at hand, the specifications 
should clearly dictate these understandings. Likewise, the rater(s) of the 
written work should bring a common understanding of the task they are to 
judge; specifications help- to achieve this commonality of task 
understanding. Further, such specification permits the generation of 
multiple, parallel assignments, for both Instruction and for teaching, 
maximizing the Integration of testing and instruction and the utility and 
fairness of the Instructional process. . 

17 

ERIC 



I 



13 



After the first two components of the domain description have been 
carefully detailed, the next task 1s to describe the distractor limits for 
selected response Items, or the response criteria for constructed response 
Items. 

Distractor Limits 

The distractor limits provide a description of the wrong answers or 
distractors that may be used as alternatives for selected response items. 
Based on specific categories of'error types, the distractor limits define 
categories of wrong answers and provide rules for generating alternative 
responses for each Item. These rules should represent common student 
errors and, where possible, should provide diagnostic information about the 
source of student error. 

Here are two examples, continuing with our ongoing math and English 
mechanics Illustrations, of distractor limits descriptions: 

* Math — distractors will be drawn from a set of shapes that 
""are lacking 1n one of the following characteristics: 

3 sides 
stralghtness 
closed edges 

0 English mechanics — distractors will be drawn from words 
In the sentence that are properly capitalized. 

Response Criteria 

Unlike selected response Items, constructed response Items do not 
include wrong answers to distract the student's choice of a correct 
response. In place of distractor limits descriptions, domains for con- 
structed response Items require a description of the rules and criteria 
that will be used to judge the quality of the student's response. 



18 



14 



There are two judgment strategies that can be used in grading stu- 
dents' constructed responses — separate criteria or holistic judgment . In 
the case of students' written work, for example, using the separate crite- 
ria approach might Involve giving points according to how well the written 
product satisfies each of several distinctive criteria (such as those set 
forth below). On the other hand, the hoi istlc judgment approach relies on 
one overall assessment of the students' work. While 1t 1s true that even 
In the holistic approach we use judgmental criteria to reach an overall 
judpent, such as the extent to which a paragraph displays acceptable 
organization, these criteria are applied 1n a comprehensive sense rather 
than 1n the cr1ter1on-by-rcriter1on manner characteristic of the separate 
criteria approach. 

Careful procedures need to be used during the rating process to assure 
reliable results. Irrespective of which judgmental approach 1s selected, 
It 1s Imperative that those individuals who will be judging the paragraphs 
engage in tra1n1ng/cl arificatlon sessions prior to their actual judging of 
students' work. Judges should read the same student production, render 
their judgments Independently, then share these judgments and discuss their 
reasons with other judges. Disagreements regarding the meaning of certain 
criteria should be resolved. This process should be continued until a high 
degree of Inter- judge agreement is achieved (see Quellmalz & Burry, 1983, 
for a more detailed discussion of these Issues related to writing 
assessment) . 

In addition, during the actual judging, 1t 1s desirable to have each 
student's work rated independently by two judges, with a third rater being 
called on to resolve disagreements. 

Continuing with our ongoing English composition Illustration, here are 
sample response criteria descriptions. 

W£ 19 



15 



Organization 

1. The student has written about the assigned topic. 

2. The paragraph Includes a topic sentence which embodies a 
position regarding the assigned topic. 

3. All other sentences 1n the paragraph support the topic 
sentence. 

Mechanics 

1. The paragraph 1s written legibly. 

2. Complete sentences are used (rather than fragment or 
run-on sentences) . 

3. Words are spelled correctly. 

4. Punctuation Is correct with regard to use of commas, 
capital Ization, etc. 



Now, 1f criteria such as the above were to be used as the basis for 
judging students' written essays, 1t would be a good idea to develop a 
scale with, say, one to five points, so that those judging the composition 
could then assign a point score to Indicate the extent to which the 
criteria were satisfied. In addition, before using criteria such as those 
suggested above, 1t would also be a good Idea to have judges make sure that 
they agree on what each criterion actually means: For example, that they 
all agree on definitions of "position," "support," and even on such mundane 
matters as "punctuation" since, 1n some Instances such as the use of serial 
commas, correctness 1s as much a matter of convention as 1t 1s of 
hard-and-fast rules. 

To this point, then, the evolving domain specification describes the 
domain, Its content limits and, depending on the response mode 1n which the 
student will answer, either the dlstractor limits or the response 
criteria. The remaining three elements of the domain specification consist 
of format, directions, and sample test item. 



9 

ERIC 



20 



16 



Generating rules governing item format, directions to the student, and 
sample item 1s usually easier than generating rules for the first three 
parts of the specification. However, the clarity of Item format and 
directions can have Important consequences and therefore deserve careful 
attention. Further, parts of the specification already written may need to 
be modified to accommodate problems 1n setting up and formating the sample 
Item. 
Format 

The format section of the test specifications provides a careful 
description of the form 1n which items can be presented. Again, here are 
some example format descriptions for the three domains we are Illustrating: 



0 Math — multiple choice: four shapes will be presented as 
response alternatives;, only one of which 1s a proper 
triangle. ' 

° English me ch anics — multiple choice: a sentence will be 
presentecTwl th four words or word groups from the 
sentence as response alternatives, one of which 1s 
Incorrectly capitalized or left uncapl tal 1zed. 

0 English composition -- constructed: the student will be 
presented, aurally and written, with a three-part 
expository prose prompt; lined notebook paper will be 
provided for essay responses. 



Directions 

This section of the test specifications provides the actual set of 
directions to be used or rules for generating directions to the student for 
completing the test Item. For example: 



ERIC 



21 



17 



0 Math -- Look at the four shapes below. Only one of them is 
a triangle. Mark an X on the shape that is a triangle. 

0 English mechanics — Read the sentence below, and then 
circle the letter next to the word that is improperly 
capitalized. 

0 English composition The paragraph you write should stay 
on the assigned topic. In your paragraph, be sure to 
take a position regarding the issue and support the 
position you have taken. Make sure your paragraph is 
well organized and has appropriate grammar, spelling, and 
punctuation. Write clearly on the paper provided. 



Sample Item 

The sample item follows the rules set up in the preceding five parts 
of the domain specifications, and stands as a guide that test developers 
can follow as they develop items. Here are three examples from the domains 
we are using for Illustration: 



Math — Look at the four shapes below. Only one of them is 
a triangle. Mark an X on the shape that is a triangle. 

A V Q 

English mechanics — Read the sentence below, and then 
circle the letter next to the word that is improperly 
capitalized. 

My Grandmother gave me a Timex watch for Christmas. 

A. My 

B. Grandmother 

C . T1 mex 

D. Christmas 



18 



0 Engl 1 sh composition 

Background : Some people think that there should be 
letter grades given for high school classes, while other 
people believe that all classes should be grad?d as 
either pass or fail . 

Assignment : Write a paragraph 1n which you are in favor 
of, or opposed to, a pass/fall grading system in high 
school 

The paragraph you write should stay on the assigned 
topic. In your paragraph, be sure to take a position 
regarding the Issue and support the position you have 
taken. Make sure your paragraph 1s well organized and 
has appropriate grammar, spelling, and punctuation. 
Write clearly on the paper provided. 



Summary 

So far, then, we have examined and illustrated the six major features 
of a domain specification: The domain description provides a general 
statement of what's expected of the student in a particular content area; 
the content limits set the range of eligible content for writing test* 
items; the dlstractor limits describe the wrong answers that can be used 
for selected response items, and the response criteria establish rules for 
judging a student's constructed response; the format describes the Item 
presentation form; the directions tell the student what he or she is to do 
1n answering the question; and the sample Item , following all of the above 
provides a visual aid for Item writers to rely upon as they write addi- 
tional Items for the domain. 

To show what a fully-developed domain specification might look like, 
we have provided an example below for 5th -grade language arts (other 
samples are in Appendix A). 



2;j 



19 



Sample Domain Specification 
Grade Level : Grade 5 



Subject: 



Language Arts 



Domain Using correct capitalization in paragraphs 

Description : adapted from a standard fifth grade text of 
practical /informative nature. 



Content 
Limits: 



Distractor 
Limits: 



Format : 



Directions: 



Samp! e 



The student will be presented with a paragraph of 
at least six sentences in which all the capital 
letters hav > been omitted. Reading level should 
be fifth gr-de or lower. The test questions will 
consist of Identifying the words 1n a given sen- 
tence of the paragraph which must be capital- 
ized. These words may Include: the first word 
of a sentence; the names of languages, people, 
schools; days of the week; months of the year; 
places and buildings; titles of books or movies. 

Each question will consist of correctly Identify- 
ing all the words 1n one sentence, listed by 
number, which need to""5e capitalized to make the 
sentence correct. 

The alternate responses to the questions- may 
Include: a) omission of one word(s) within the 
given sentence which should be capitalized; or 
b) listing of a word or words 1n the given 
sentence which should not be capitalized. 

Each sentence of the paragraph will be numbered. 
Each question will be multiple choice, with four 
possible responses. 

The directions will be given: "Choose the letter 
which contains all the necessary capitalized 
words 1n the given sentence to make the sentence 
correct." 

1. of all my high school friends, 1 remember jim 
the best. 2. he had a way of making adventures 
out of everyday events. 3. one Sunday 1 remember 
particularly; 1t was'a beautiful day 1n may. 4. 
1 looked out the window, watching the sunlight 
dance on the Columbia river. 5. my mom Inter- 
rupted my daydreams^-remindlng me about my 



ERIC 



24 



20 



homework for my german class. 6. 1 started 
flipping through my history book, the amerlcan 
republic, to avoid beginning the german grammar. 
7. suddenly a hissing voice outside the window 
attracted my attention. 8. 1t was j1m; he was 
ready for his favorite activity, fishing. 9. we 
sneaked down the back stairs and out the back 
door. 

1. In the first sentence, the following words should 
be capitalized: 

«/ a. Of, I, Jim 

b. High School 

c. Of 

d. Of, I 



These specifications, then, are the blueprint that item writers follow 
as they develop test Items. As we will see later, the care that goes into 
developing domain specifications is matched by the care with which the 
developed items are judged to determine the extent to which they match the 
intentions of the domain specification. But there is one additional con- 
sideration to keep in mind while the items are being developed: The 
technical properties of a good item, independent of its governing domain 
specification. We'll take up this topic 1n the next section. 

ITEM CONSTRUCTION RULES 
When the domain specifications have been written, reviewed for substance 
and clarity, and checked to make sure they work as Intended (e.g., the 
writer of the specifications can try to develop a few additional sample 
items, or ask a colleague to make this check to ensure there are no bugs 1n 
the specifications) item writing begins. 



ERIC 



25 



21 



Items are of course written to match the specifications. But there is 
another consideration as well. In constructing items, there are some 
generally agreed upon rules, or perhaps rulec-of-thumb, that help make sure 
that items do not contain flaws that unnecessarily cue or confuse the 
student. That is, having good domain specifications does not guarantee 
that the items generated from it will necessarily be good items. An item 
can have a perfect fit with its guiding specifications and yet be flawed. 

In this section, then, we'll offer some rules for writing constructed 
response items essay and short answer or completion items, and for 
writing selected response items true-false, matching, and multiple 

choice items. v 
Constructed Responses 

Essay Items : An essay item asks the student to produce a piece of 
written work ranging In length from one to several paragraphs. In writing 
essay type items, we need to keep the following rules 1n mind: 

1. The task expected of the student should be defined as completely 
as possible, without Interference with the measurement of the 
domain being tested. 

2. The topic to be written on should represent a novel situation or 
problem, and not be a repetition of situations or problems used 
for instructional purposes. If the test question merely repeats 
something that has happened In class, then all it requires from 
the examinee 1s recall, which is more efficiently measured by 
another test format, such as multiple choice. 



ERIC 



26 



22 



3^ To-obiain 44ee^a^reftab j H-ttyi "the st(«IfnT"ne^s"tyiiaveT cl ear 

picture of what constitutes arracceptable response. It 1 s al so 
necessary to have a detailed scoring guide. Reliability 1s also 
Increased by having each student answer several questions, or by 
having several independent scorers per answer. 
4. If students are allowed to select among several questions, each 

question should be of the same difficulty level. 
Short answer and completion Items ; Each of these Item forms 1s 
answered by a word, phrase, number, or other symbol that Is written by the 
student. The two forms are essentially the same, and differ only 1n how 
the problem 1s presented to the student. The short answer Item asks a 
direct question of the student, while a completion Item consists of an 
Incomplete statement to be completed by the student. Here are some rules 
for this item genre: 

1. The question Itself should not provide any extraneous clues to the 
answer. 

2. The question must be stated so that only one brief answer 1s 
posslbl e. 

3. No grammatical clues should be given, such as "a" or "an." 

4. The student should receive clear directions stating the degree of 
precision expected and/or the un1t(s) 1n which the answer 1s to be 
expressed. 

5. The scoring or answer key should anticipate possible synonyms or 
acceptable variants of the desired response. 

27 




23 



6. Only key words should be left blank; 1t is generally better to 
have the blank (to be filled 1n) at the end of the statement 
because the student will then have 1n mind all Information needed 
to give an answer (this tactic may also simplify the scoring); the 
blanks should be uniform 1n length throughout the test; where they 
are positioned should generally be the same. 

7. Statements 1n which the blank would complete an Instructional 
cliche should be avoided. 

Selected Responses 

True-fal se: The true- false Item consists of a statement that the 
student will mark true or false, right or, wrong, correct or incorrect, yes , 
or no, fact or opinion, agree or disagree, and so forth. In each case, 
there are only two possible answers. Here are some rules for true-false 
Items: 

1. The Item must be free from ambiguity. It m 1 , ? unequivocally 
true or faHse. It 1s difficult to develop an Item to have this 
property while at the same time assuring that 1t remains unclear 
to the novice student and unambiguous to the knowledgeable. 

2. The question must embody only one Idea. 

t 

3. The question should be stated 1n positive form whenever possible. 
It must never contain a double negative and, if 1t 1s stated 1n 
the negative form, the negative word should be clearly marked. 

4. The question should be worded so that the student with only 
superficial knowledge would be led to the wrong answer. 



28 



24 



5.. The question can be. worded so that- the incorrect answer Is 

consistent with a popular misconception or belief. It can also 
use phrases In false statements to give them a "ring of truth." 
These devices, however, must be handled carefully, since the 
intention should not be to trick the student. 

6. The question should not depend for Its truth or falsity on an 
Insignificant word or phrase. 

7. The question should not include Indefinite terms, degrees, or 
amounts. 

8. The question should not include specific determiners. True items 
should not be qualified; false items should not be absolute. 

9. The questions should be evenly divided between true and false to 
help avoid biases due to guessing. 

10. Material emphasized 1n the question should not be based on an 

Instructional cliche. 
Other Considerations : Since there are only two response options open 
to the student, there 1s a 50-50 chance of any guess being 
correct. True-false items are also open to criticism that the 
ability to recognize an incorrect statement 1s not necessarily 
dependent on knowledge of the correct answer. In general, any 
question that car, be presented 1n a true-false format can usually 
be presented more effectively 1n a multiple choice format. 
Matching Items : A matching question consists of two columns with each 
word, number, or symbol 1n one column matched to a word, sentence, or 
phrase 1n the other. The student's job 1s to identify the pairs of items 
on the basis Indicated. Here are some rules for this Item form: 



9 

ERIC 



2,9 



25 



\ 



U Students- shojijd be provided with clear directions explaining the 



basis of matching. — 

2. The entire Item should appear on the same page. 

3. Components should be short 1n phrasing and few 1n number. 

4. The two columns should have appropriate labels. 

5. Each alternative must be a plausible solution to all problems 
presented. 

6. The lists should have an unequal number of components to be 
matched. 

7. ^Components 1n the response column should be placed 1n some logical 

order. 

Other considerations : It 1s often a good tactic to Inform the student 
that each of the possible answers 1n the response column may be 
used more than once, just once,' or not at all. This tactic, 1n 
conjunction with the provision of more responses than Items to be 
matched, minimizes the role of guessing. For example, with 
one-to-one ratio, 1f a student knows all but two of the correct 
matches, he/she must choose where to place the two remaining 
responses. If she/he guesses correctly with either, then she/he 
guesses correctly with both. 

The alternatives should all be of the same class of response; e.g., 
historical events, cities, etc. the student should not be able to 
discard any alternative because 1t 1s Illogical , does not fit with 
the elements of the other column, and so forth. 




30 



26 



Multiple choice Items ; ^muttTple choice item presents the student 
with a problem and a 11st of suggested solutions. The student is typically 
requested to read the stem and to select the one correct, or best 
alternative. The following rules apply to multiple choice Items: 

1. The stem should contain a complete statement of the problem to be 
solved. 

2. The stem should be stated in clear, precise lanaguage. 

3. The alternatives should be presented 1n a logical order: e.g., by 
chronology, number series, etc. 

4. In a vocabulary Item, the term should be 1n the stem, and the 
definition among the alternatives. 

5. Either the stem should be stated 1n positive form, or the negative 
word should appear at the end of the stem and be clearly marked 
for emphasis. 

6. All alternatives should be consistent with the grammatical and 
syntactical construction of the stem. 

7. All alternatives should be approximately equal 1n length. 

8. The alternatives should make reference to the item stem and not to 
the correct answer. 

9. All alternatives should be equally attractive or plausible to the 
uninformed examinee. 

10. The item should not have less than four alternatives (Including 
the correct response) . 

11. The position of the correct answer should be evenly divided among 
the response options (e.g., 1n a 25 Item test with alternatives A, 



9 

ERIC 



31 



27 



B, C, and D for each, the correct response should be evenly dis- 
tributed across all four letters, 1n some sort of random 
ordering). 

12. The correct answer should not be matched by an opposite distractor 
unless another pair of opposites is Included among the other 
dlstractors. 

13. The correct answer should not contain a repetition of a word or 
, phrase found 1n the stem. 

14. Items using pictures as stimuli should not Inadvertently provide a 
clue to the correct answer. 

15. The stem should not take a great deal of student reading time. It 
should contain material common to all alternatives so as to 
decrease reading time. 

16. The correct answer should not contain an Instructional cliche. 
Other considerations : Items should be Independent of each other; the 

correct answer to one question should not be necessary to obtain \ 
the correct answer to another question. Similarly, the Informa- 
tion 1n the stem of one Item should not become an aid 1n detecting 
the answer 1n another Item. 

"All of the above" should generally be avoided since this alternative 
Increases the probability of the student guessing correctly by 
using partial Information. "None of the above" should also be 
used with caution, since recognizing Incorrect answers does not 
ensure that the student actually knows the correct answer. 

Verbal cues to the correct answer should be avoided. 



ERIC 



32 



28 



Stem and alternative language level must be appropriate to the 

student and to the question. For example, because of the prose 
used 1n the stem, or because of tortured construction, It may be 

j 

that the Item actually becomes a question assessing linguistic 
ability or Inferential reasoning rather than assessing the 
intended domain. 

Each Item must have the same number of alternatives no less than 
four. 

The student should be able to derive the problem from the stem, and 
should not need to read the alternatives 1n order to discover what 
question 1s being asked. 

Although the rules we have offered above are generally accepted among 
test developers, some are more a matter of taste or convention than 
others. At any rate, as with other aspects of good test development, these 
rules should be provided to the people who will have the job of developing 
test items. They can then discuss any possible areas of disagreement and 
modify the rules 1f such modification will r\o/ jeopardize technical 
quality . The key notion here 1s that al I / t / tem writers understand, accept, 
and apply a uniform set of rules so that when they follow the domain 
specifications to write their Items the Items will also have acceptable 
technical quality. 

Once the Items have been developed, the next job 1s to judge the 
degree to which they reflect the Intentions of the domain specifications 
and are technically adequate. We have developed an Item rating scale to 
help with this task. 



33 



/ 29 

THE ITEM RATING SCALE 

Background 

As we have seen, domain-referenced testing limits and defines a class 
of behaviors, skills, or Information and provides a set of specifications 
which are used to generate test Items reflecting the Instructional 
process. These specifications permit the Integration of testing and 
Instruction and Increase the usefulness of testing. The validity of the 
process depends on the match between the domain, Instruction, and assess- 
ment. A critical consideration, therefore, 1s the extent to which the 
Items match their specification. 

Even with the most careful specification and Item development process, 
1t is likely that Items will vary 1n the degree to which they fit or belong 
In the domain which they are Intended to assess. Most commonly the judg- 
ment about how well an item matches or measures the domain 1s not a clear 
yes/no choice but rather should reflect the complexities of test sped - 
flcatlon and Item development. The Item Rating Scale (IRS) we offer, 
therefore, provides a range of values that are used 1n judging the "belong- 
ingness" of an Item to a domain. It permits judgments to be made about 
Item compatablHty with each of the categories 1n test specifications. 
Further, these judgments suggest areas 1n which an Item, or Its governing 
specifications, may need to be modified and Improved. 
Using the IRS 

Raters use the IRS to judge the match between the test specifications 
and any given Item along eight Independent dimensions. The first six 
rating categories of the IRS parallel the basic structure of domain- 
referenced test specifications we discussed earlier. In addition, test 
Item features of linguistic and thinking complexity are also Included 1n 

34 



30 

the IRS. Just as we suggested for the Item construction rules and other 
specified criteria, raters will need to become familiar with the IRS before 
they use 1t to judge items. 

The first category of the IRS concerns the item's fit with the general 
domain description. The second category, content limits, compares the 
description of eligible subject matter and item features with the test 
Item's contents and features. The third category judges the Item against 
dlstractor limits or response criteria, depending on whether the Item is a 
selected or constructed response type., For selected response Items, the 
specification rules for creating wrong answer alternatives are compared 
with the actual wrong answer choices used In the test Item. For construc- 
ted response items, the prescribed criteria for evaluating the response 
generated by the student are compared both to those criteria used and to 
the suitability of the Item and conditions for eliciting a judgeable 
response. Format and directions are the fourth and fifth categories be- 
tween specifications and actual items. In these categories the concern 1s 
whether the layout of the Item and the directions for completing the test 
conform to the test specifications. The sample Item provided 1n each 
specification 1s the final aspect of the test specifications Included 1n 
the Item Review Scale. 

Linguistic complexity and thinking complexity provide a structure for 
getting an accurate picture of some of the more subtle sources of 
complexity that may affect students' performance In a way not described or 
desired by the test specifications. These biasing elements are Important, 
to the degree that the specifications and resulting Items are Intended to 
provide the same measure of performance for all students 1n the given 
area. , 



31 

Raters assign a whole number value that best represents their judgment 
of the match between the Item and Its specification on the particular 
dimension being considered. After arriving at the rating for the Item on 
the first dimenslori, I.e., domain description, raters proceed, one 
dimension at a time, rating the Item- specification match on each 
dimension. 

When all eight categories have been Individually scored, an overall 
rating 1s then calculated for the Item. The final calculations are guided , 
by a weighting system that Incorporates the scores 1n each category. 

The ratings are then Interpreted 1n terms of the three features judged 
to be most critical content limits, distractor limits or response crite- 
ria, and thinking complexity. These Interpretations carry Implications for 
Item revision or, where necessary, specification revision. 

Let's take a closer look at this process, now. 

The scale we developed for this process ranges from 0 to 10, with 0 

Indicating a poor match between Item and specification, and 10 Indicating a 

perfect match. The scale provides the following guidelines to raters for 

assigning number ratings 1n each component of the domain specification: 

0,1,2 This rating range should be used for Items that are completely 
unrelated to the specification on the dimension you are rating. 

3,4,5 This rating range should be used for Items that are vaguely 
related and/or Inadequate. 

6,7 This rating range should be used for Items you feel would 

definitely require a second look and some revision, but which 
you feel reluctant to totally abandon. 

8,9 This rating range should be used for Items that you feel are 

good representative match-ups with the specifications, although 
slightly off. 

10 This rating should be used for Items that are beyond a doubt 
perfect examples of the specification. 



36 



32 



(The ratings representing each descriptor are a function of the multiple 
criteria which are used to assess each domain dimension.) 

Using the scale and Its suggested point-assignment guidelines, item 
raters should refer to the following indicators of Item match with each of 
the eight features on which items are to be judged. Depending on the 
degree of match, raters then assign the item a number from 0 to 10 for each 
of the domain specification components. 

1. Domain Description 

1. The test Item 1s a good and fair representative of 
the subject area outlined 1n the domain description 
of the test specifications. It does not assess an 
obscure or unusual aspect of the ,doma1n. 

2. Test Item conditions are not at odds with test 
Intentions. This 1s especially important 1n 
constructed Items. 

3. The test Item content 1s closely related to the 
Instructional objectlve(s) stated or Implied 1n the 
domain description. 

2. Content Limits -- Selected Response Items Only 

1. The Item and additional accompanying material (e.g., 
graphs, maps, reading selections) follow the content 
limits on length and general difficulty level. 

2. The Item and additional accompanying material follow 
the content limits on eligible content, descriptive 
detail, and completeness of information provided. j 

3. The solution processes required by the student to I 
answer the Item match those described or Implied 1n ! 
the content 1 1m1ts. I 

2. Content Limits Constructed Response Items Only 

1. The Item matches the content limits on eligible 
content, descriptive detail, or completeness of the 
prompting information provided. 

2. The Item provides a context for responding that Is 
similar to that described 1n the content limits 
(e.g., time restrictions, length of written/oral 
response, equipment or aid restrictions, warmup or 
false start provisions). 



er|c 37 



/ 



33 



3. The mental processes required by the student to 

respond to the Item seem to match those described or 
Implied 1n the content limits. 

3. Dlstractor Limits — Selected Response Items Only 

1. The alternative answers, or distractors, provided 1n 
the Item require the student to discriminate Impor- 
tant features or factors described 1n the dlstractor 
limits as differentiating correct from Incorrect 
answers. Distinctions between correct and Incorrect 
answers are not based on trivial or irrelevant 
features. 

2. The distractors provided 1n the Item correspond to 
the content limits on number, length, and general 
levej of difficulty. 

3. Response Criteria — Constructed Response Items Only 

1. The rules used to judge the student's response are 
those described by the response criteria. 

2. The Item prompt provides a context for responding 
that 1s appropriate to^the response criteria for 
Judging the content and style/form of the response 
(I.e., likely to elicit a judgeable response). 

3. Problems arising from Incomplete or Inadequate 
answers are dealt with 1n a way that upholds the 
testing Intentions of the specifications. 

4. Format 

1. The organization and display (layout) of the item 
conform to the format description 1n the test 
specifications. 

2. For selected response Items only: The organization 
and display of any additional Information (e.g., 
maps, graphs, pictures, reading selections) conforms 
to the format description. 

3. For constructed response Items only: The context or 
conditions for responding to the item (e.g., time 
limits, space limits, available equipment) conform 
to the format description. 

/ 

5. Directions 

1. The directions for completing the test Item 



ERIC 



34 



correspond to the description of test directions 1n 
the test specifications. 

2. The reading level and complexity of the directions 
follow the description of test directions 1n the 
test specifications; or seem to be within suitable 
range for the Intended students. 

6. Sample Item 

1. The sample Item and the, test item being rated could 
come from the same set of Items described by the 
test specifications. 

2. The sample Item and the test item are very similar 
1n content and either dlstractors or response 
criteria. 

3. The sample Item and the test item are very similar 
1n format and directions. 

7. Linguistic Complexity 

1. Vocabulary used 1n the Item 1s consistent with the 
test specifications for Item difficulty. Words are 
not used that have different or unfamiliar meanings 
for different students or student groups. 

2. Item language structure, (Including, e.g., the use 
of compound, complex sentences, antecedents) 1s 
consistent with the test specifications for Item 
difficulty. 

8. Thinking Complexity 

1. Those mental processes required for the solution or 
performance of the tert Item, but that are not 
described 1n the domain description or contenT 
limits (I.e., are assumed), are readily available to 
all students at some necessary level of competence 
(e.g., drawing ability, handwriting legibility, 
short-term memory capacity, imagination, ability to 
separate relevant from Irrelevant, detail from 
generalization) . 

2. Directions for completing the test Item provide the 
same amount of Information and structure for all 
students. Everyone has the same understanding of 
what 1s expected and of what the limits or rules for 
answering are. 



ERJ.C 



39 



35 

3. For Items with nonverbal components, 1 t 1s reason- 
able to assume that these components comform with 
the content limits or abstractor limits 1n their 
Intended meaning, and tha| this Interpretation 1s 
stable across all groups of test takers. 

Each Item 1s rated on each of the above eight components, and the separate 

ratings 1t received are recorded. It 1s also helpful 1f raters provide 

written comments explaining why a particular rating was assigned. This 

kind of Information 1s extremely useful In situations where feedback 1s 

needed for revising Items or specifications. 

Overall Item Rating 

We suggest that an Item's overall rating be Interpreted primarily, but 
not exclusively, against the three most critical features of the domain 
specifications — content limits, dlstractor limits or response criteria, 
and thinking complexity. *We suggest this not, only because these three 
features provide the substantive core of any test Item, but also because 
problems 1n these areas are more difficult to correct than problems 1n 
other components of the domain specification. 

The overall Item rating process reflects the Importance of these three 
critical elements. The original ratings the Item received on the three 
critical features are first weighed, or multlpled, by a factor of three, 
and then summed with the ratings from the other specification dimensions. 

In this way, an Item can receive a final weighted score ranging from 0 
to 140, reflecting a highest possible, rating of 10 each on five of the 



ERIC 



40 



36 



domain components, and a highest possible rating of 30 (3 X 10) on each of 
the three critical elements. 

The total score an Item receives through this process 1s then divided 
by 14 (I.e., the total number of points on the weighted scale -- five of 
the elements are unweighted and each has a value of 1; three of the 
components are weighted and each has a value of 3) and the quotient, 
ranging from 0 to 10, 1s the Item's overall rating. 

After or during the rating process, Items will also need to be 
reviewed for their adherence to technical rules of Item construction listed 
earlier. Where problems occur, Items will need to be revised accordingly, 
1f the Item 1s judged to be an adequate match to Its specification. Should 
recurring technical flaws of the same type be found, revisions or additions 
to the domain specifications may be necessary. 

Let's take a moment now to consider the Issue of 1tem-rat1ng • 
Interpretation. 

Interpreting an Item's Overall Rating 

Because we are concerned about an Item's validity and the accuracy 
with which 1t assesses student knowledge 1n the given domain, we suggest 
that high quality standards be applied to the Interpretation process. That 
1s, with a possible final rating of 0 to 10, we suggest that the first 
criterion to consider 1s whether or not an Item receives an overall rating 
of at least 7 or 8 points. 



41 



37 



After this determination has been made, we can then take a much closer 
look at Items with an overall rating of 7 or better and those with an over- 
all rating below 7 points. The point of this process 1s to ascertain how 
much the cr1 tical , weighted, features contribute to an item's overall ~ 
rating, and to base our decisions accordingly about keeping an Item as 1s, 
modifying an Item, rejecting an Item, or revising the original domain 
specifications. 

We offer the following suggestions to guide the Interpretation 

process. 

Items Rated 7 or Better; 

(1) If all three of the weighted, critical features are rated 8 or 
better (here and elsewhere this statement means before rating weights were 
applied), then the item 1s good and basically 1n comformlty with the test 
specification. Any further Item review and/or rewrite process should be 
directed toward other domain features on which the Item received a lower 
rating. 

(2) If any one of the critical features received a rating of 7 or 
lower, 1t will be necessary to return to the original Item specifications 
guiding that feature, so as to better align the Item with the described 
testing Intentions. An Item 1n this category likely has problems 1n other, 
non-weighted features. It will probably be necessary to rewrite the Item 
and examine the revision to make sure that all critical features are still 
up to par. 

(3) If more than one critical feature received a rating of 7 or lower, 
the Item has serious validity problems. If 1t 1s decided that this Item 



ERIC 



42 



38 



exemplifies the kind of test Item that 1s actually desired, then 1t will be 
necessary to reconsider Its guiding specifications. They may need to be 

better conceptualized, reconceptual 1zed, or become more complete. 1n their 

descriptions of desired Item qualities. If, however, the specifications, 
as written, are close to the actual testing Intentions, then the Item 
should be discarded from the pool and replaced with a new one. 

The statements provided previously 1n the IRS rating categories can be 
used as a guide in Item review and revision. 

Items Rated Below 7 : 

(1) If all three critical, weighted features are rated 8 or better, 
the Item 1s potentially a good one but It has serious problems 1n presen- 
tation. It will again be necessary to return to the original spedflca- 
tlons guiding those features receiving low ratings, and to revise the 

Item's manner of presentation. - • 

(2) If one or more of the critical features scored 7 or lower, the 
item probably 1s not worth any attempted f1x-up effort. However, before 
starting the Item writing process again, 1t would be a good Idea to recon- 
sider the guiding specifications; they may need to be better conceptualized 
or provide fuller descriptions of the desired Item features. 

Again, the IRS statements describing Item match can be used 1n Item 
review and revision. 

USING THE GUIDE 

Appendix C contains copies of the materials that a district or school 
will need 1f they wish to apply the Item review process we have described 
here. The appendix provides the directions to raters, an individual rating 



o 

ERIC 



43 



39 

worksheet, an overall Item rating form, and the guide to Interpreting 
overall ratings. 

Should a district or school develop domain specifications following 
the procedures 1n this guide, and then have Item writers develop Items to 
measure these domains, the materials 1n Appendix can be reproduced to 
guide the local Item review process. Raters would then be given a copy of 
the domain specifications, the Items written for 1t, and the materials we 
offer to facilitate the rating process. 

To guide the Item development process, the district or school should 
also familiarize Item writers with the construction rules we offered 
earlier 1n the guide. To guide the Item rating process, the district or 
school should provide raters with a copy of the Indicators of Item-domain 
match that also appeared earlier 1n the guide. " 

For districts or schools who wish to Implement the procedures 1n the 
guide, we strongly suggest that a local staff member (or outside expert) be 
designated as facilitator. That person should become thoroughly familiar 
with the contents of the guide, and take responsibility for leading 
discussion/orientation sessions on domain specifications, Item writing, and 
item rating and Interpretation with the staff members who will be 
responsible for carrying out these activities. 

In addition, the local facilitator should take responsibility for 
making any materials adaptations that are appropriate 1n the local setting, 
and then oversee the entire process of domain specification, Item 
development, and item review. 



44 



40 



CONCLUSION 

In this guide we have provided a means of developing domain specifi- 
cations which create a direct link between Instruction and assessment. 
These specifications establish rules which can be applied to develop items 
and tests which represent the domain of Instructional concern and provide 
accurate assessment of student learning 1n that domain. 

While this match between an Item and Its specification is of primary 
Importance, 1t 1s also Important to assure that test Items are free from 
technical flaws. To maximize this possibility we have outlined some 
generally accepted rules of Item construction. 

In addition, no matter how clear the domain specification or how 
skilled the item writer, some Items will provide a better fit with a given 
domain than others. The Item review scale we have described offers a way 
to systematically judge an 1t«m's fit and to obtain Information suggesting 
areas 1n which an Item, or Its domain specification, need to be Improved. 

We admit here that, Initially at least, such a painstaking approach to 
test planning, development, and review requires some time. We can point 
out, however, that people who have attended the workshops we have conducted 
on the topics 1n this guide report that after practice the process tends to 
become Internalized and developing tests 1n such careful fashion becomes 
routine. 

Finally, given the Increasing need to develop tests which are appro- 
priate to local needs and sensitive to local Instructional practice and 
Intentions, we believe that the Initial Investment of time 1s well worth 
the effort. 



ERJ.C 



45 



41 



REFERENCES 



Burry, J. , Dorr-Bremme, D.W., Herman, J.L., Lazar-Morrison, CM. , Lehman, 
J.D., & Yen, J. P. Teaching and testing: Allies or adversaries ? CSE 
Report No. 165. Los Angeles: UCLA Center for the Study of Evaluation, 
1981. 

Burry, J., Catterall, J., Choppin, B. , & Dorr-Bremme, D.W. Testing in the 
nation's schools and districts: How much? What kinds? To what 
ends? At that costs? CSE Report No. 194. Los Angeles: UCLA Center 
for the Study of Evaluation, 1982. 

Dorr-Bremme, D*W. Assessing students: Teacher's routine practices and 
reasoning. Evaluation Comment , 1983, 6(A), 1-12. 

O'Shea, D.W. School district evaluation units: Problems and 

possibilities. In A. Bank & R.C. Williams, (Eds.) Evaluation In 
school districts: Organizational perspectives . CSE Monograph No . 
TJT. Los Angeles: UCLA center for tne 5tuay of Evaluation, 1981. 

Quellmalz, E.S., & Burry, J. Analytic scales for assessing students' 
expository and narrative writing . CSE Resourse naper no. t>. Cos 
Angeies: UCLA center for the study of Evaluation, 1983. 



46 



42 



APPENDIX A 



SAMPLE DOMAIN-REFERENCED 
TEST SPECIFICATIONS 



43 



Grade Level : 

Subject : 

Domain 
Description 



Content 
Lim1 ts: 



Grade 9 

EngTlsh-punctuation 

Correctly punctuating given paragraphs adapted from a 
standard eighth grade text of a practical /informative 
nature. 

The student will be presented with one paragraph in which 
all the correct punctuation marks have been omitted, except 
for apostrophes 1n contractions (I'll) and possessives 
(Jane's), dashes, and semi -colons. 

For each question, students will be asked to choose all the 
correct/punctuation marks which must be added 1n a gTven 
sentence to make the sentence correct. The punctuation 
marks to be Identified and added may Include: 



Dlstractor 
OnTTEs 



a. 

b. 
c. 

d. 



e. 



f. 



periods at the end of a declarative or Imperative 
sentence, after an abbreviation, or an Initial 
question marks following an Interrogative sentence 
exclamation point after excl amatory sentences or 
interjections 

colon after the salutation 1n a business letter, or to 
separate minutes and hours in expressions of time, and 
to show that a series of things or events follows 
quotation marks enclosing a quotation or a fragment of 
it, enclosing the title of a story or poem which is a 
part of a larger book 

comma in a date or address; to set off such words as 
"yes 1, at the beginning of a sentence; to set off names 
of persons or words (phrases) 1n apposition; to 
separate words 1n a series, direct quotations, parallel 
adjectives, parenthetical phrases; after Introductory 
prepositional phrases; before coordinate conjunctions; 
after the salutation and closing 1n a friendly letter; 
to separate a dependent clause from an Independent 
clause 1n a complex sentence. 

The alternate responses to the questions may include: 

a. omission of punctuation mark (s) within a given 
sentence which should be Included, or 

b. Inclusion of a punctuation mark or marks which is not 
necessary or correct in the given sentence 



FRir 



48 



44 



Directions: 



Format: 




The directions will be given: "Choose the letter which 
contains all the necessary punctuation marks in the given 
sentence which will make the sentence correct." 

Each question will be multiple choice, with four possible 
responses. 

1. If she starts to sing again I'll crack up Z. It 1s 
funny how 1t hurts to hold back a. laugh 3. I was sitting 
in the auditorium at 10:00 am and we were having a singing 
rehearsal for graduation 4. Si t up Get off those 
shoulders Think tall S1ng tall Sing Hke this said Ms Small 
5. I knew that if she was going to tweet Hke a bird again 
I would laugh 6. But I just could not laugh because Ms 
Small would kick me out of the auditorium and that meant 
Fel son's office—and no graduation 7. La la la--s1ng 
children S1ng with your hearts said Ms Small 8. I 
couldn't hold 1t 9. She was so funny I almost rolled off 
the auditorium seat 10. The other students didn't laugh 
but I sounded Hke Santa Claus 11. It became quiet for a 
second 12. What are you doing Joe I know 1t 1s you 
Present yourself to Mr Fel son at once that voice said 
13. Ms Small 1s a foot shorter than a tall Coke but she 
has the bark of a hungry hound dog 

1. The first sentence should be written: 

a. If she starts to sing again I'll crack up. 

b. If she, starts to sing again, I'll crack up 
<S c. If she starts to sing again, I'll crack up. 

d. if she starts, to sing again, I'll crack up. 



ERJ.C 



49 



4 



t 

45 



Grade Level : Grade 8 



Subject : 

Domain 
Description ; 

Content 
Limits: 



Distractor 
Limits: 



Introduction to Algebra 

Using basic operations and laws governing open sentences, 
solve equations with one unknown quantity. 

1. Stimuli Include a number sentence with one unknown 
quantity, represented by a lower case letter in 
italics, and an array of four solution sets or single 
answers, only one of which is correct. 

2. Number sentences may be statements of equalities or 
inequalities. 

3. The number sentences may require simplifying before 
solving by combining like terms or carrying out 
operations indicated (e.g., by parentheses). 

4. Number sentences will have no more than five terms. 
Fractions may be used but not decimal fractions and 
non-decimal fractions 1n the same expression. 
Exponents (powers) may appear 1n the expression only 1f 
they cancel out and need not be solved or modified. 

5. Solution sets for equations and Inequalities will be 
drawn from the set of rational numbers (± ). The null 
set (0) may be used as a correct solution set. 

6. Factoring may be a requisite operation for solving the 
equation. 

7. Application of the distributive property of multiplica- 
tion and the use of reciprocal values may be requisite 
operations for solving the equation. 

1. Dlstractors may be drawn from the set of wrong anwers 
resulting from errors 1nvoly1ng any one of the 
following operations: 



a. 
b. 



c. 



d. 



combining terms 

transformations that produce equivalent equations 
(e.g., transferring terms using the principle of 
reciprocal values) 

distributing multiplication, with positive or 
negative numbers (e.g., across parentheses) 
carrying out basic operations using brackets or 
parentheses 



ERIC 



50 



J 



46 



Format ; 
Directions: 



2. Dlstractors may also be drawn from the set of wrong 
answers due to Incomplete solution sets. 

3. Dlstractors may not reflect errors due to wild 
guessing, calculations Involving negative numbers, 
errors 1n basic operations. 

4. "None of the above" 1s not an acceptable alternative. 

Multiple choice; five alternatives. 

Solve the equation. Then select the correct answer or 
solution set from the choices given. 



Sample 
Item: 



1. 8n + 2 = 2n + 38; n » ? 

a) n » 3 

y/ b) n * 6 

c) n » 4 

d) n » 5 

e) n * 7.6 



2. 16x £ 32; x = £ 

a) x 3 48 

y b) x - {0,1,2 } 

c) x = 2 

d) x - 0 

e) x » {3,4,5... } 



9 

ERIC 



51 



47 



Grade Level ; Secondary 



■Subject : 

Domain 
Description : 

Content 
Limits: 



Response 
Criteria: 



Life science - circulatory system 



Applying Understanding of the circulation system to predict 
cause-effect relationships within the system. 

I 

1. Circulatory systems Include: pulmonary circulation, 
coronary circulation, systemic circulation (renal and 
portal ) . , 

Heart strictures eligible for identification and 
differentiation of function include: left and right 
atria (or auricles) , loft and right ventricles, pulmo- 
nary artery and veins, systemic artery and veins, 
aorta, valves. 

Other structures eligible: veins, arteries, capilla- 
ries, femoral artery and vein, inferior vena cava and 
superior vena cava, jugula.- vein and carotid artery, 
brachial artery and basilic vein, portal and renal 
veins and arteries. 

Eligible cause-effect situations Include: heart 
attack, arteriosclerosis, Injury to aorta or other 
major veins and arteries (superior, Inferior vena cava, 
jugular, carotid, femoral veins/arteries, portal and 
renal veins and arteries, brachial and basilic), high 
blood pressure, pulse, heart murmur. 

2. Items on cause effect may present the cause and ask the 
effect or vice versa. These Items may be presented 
pictorlally, e.g., showing blood clot In the coronary 
artery. However, 1n these cases, all parts must be 
labelled for the student. 

1. For labelling pictures, terms must be correct; spelling 
does not count. Partial credit may be given for cor- 
rect labels 1n pictures requiring more than one re- 
sponse; Incorrect labelling that affects meaning (e.tj., 
not including the word artery or vein 1n carotid), 
should be counted as incorrect. 



\ 



ERIC 



52 



48 



Correct responses to cause-effect must Include all the 
underlined points below 1n order to be considered a 
complete and correct answer. Partial credit may be 
awarded at the discretion of the teacher. 

a. heart attack: clot 1n coronary artery preventing 
the flow of blood to the heart ; heart tissue 
damaged or destroyed due to lack of food and oxygen 
since blood can' f reach cells. 

b. Injury to major veins and arteries: should differ- 
entiate the functions and locations of the given 
vein or artery (femoral artery and vein; Inferior 
and superior vena cava; jugular vein and carotid 
artery; brachial artery and basilic vein; portal 
and renal veins and arteries; aorta). 

c. arteriosclerosis: loss of elasticity of artery 
walls which normally stretch and relax with the 
pulsing during heartbeat. Lost elasticity, often 
due to fatty deposits on the artery walls (harden- 
ing of the arteries), can create abnormally high 
blood pressure as the blood 1s pushed through 
narrower aucts^ . 

d. high blood pressure: could describe two possible 
causes — exercise (heart pumps harder to supply more 
oxygen to the muscles), and changes to the b loo? 
vessels , e.g., arteriosclerosis ( smaller tube way 
for blood flow increases pressure ). 

e. pulse and heartbeat: should describe the pumping 
action of the heart as reflected in the arteries, 
stretching the aVterlal walls, pulse as accurate 
indicator of heart action. 



f. heart murmur: must describe valve functions, 
normally and their sound (ventricles contract and 
valves close; ventricles relax and aorta valves 
open ), murmur represents backf low of blood from 
incomplete or improper valve closing . 



53 



49 



F1ll-1n; label figures; or paragraph responses. 

Complete each sentence. OR Label each part of the diagram 

representing OR Diagram (or describe) 

the P rocess through the heart. OR 

Answer each question completely, Including a description of 
causes, effects, and other processes Involved. 

Answer completely, Including a description of parts or 
functions where necessary. 

What would be the effect of injury to the carotid artery? 



4 



Format ; 
D'1 recti ons: 



Samp! e 
Item: 



54 



50 



APPENDIX B 



ITEM DIFFICULTY LEVELS 



f 



55 

ERJ.C 



51 



AN ANNOTATED COGNITIVE DOMAIN TAXONOMY* 

This classification describes, from simplest to most complex, six degrees 
to which Information that 1s taught can be learned. 



1. Knowledge. Recalling Information pretty much as it was learned . 
In its simplest manifestation, this Includes knowledge of the 
terminology and specific facts-dates, people, etc., associated with an 
area of subject matter. At a more complex level 1t means knowing the 
major sub-areas, methods of Inquiry, classifications, and ways of 
thinking characteristic of the subject area, as well as Its central 
theories and principles. 

2. Comprehension. Reporting Information In a way other than how 1t was 
learned In order~Eo snow that 1t nas been understood . 

Most basically this means reporting something learned through an 
alternative medium. More complex evidence of comprehension Involves 
Interpreting Information 1n "one's own words" or 1n some other 
original way, or extrapolating from 1t to new but related ideas and 
1mpl 1cat10ns. 

3. Application. Use of learned information to solve a problem . 

" This means carrying over knowledge of facts or methods learned in one 
specific context to completely new ones. 

4. Analysis. Taking learned Information apart . 

Analysis means figuring out a subject matter , most elemental ideas 
and their Interrelationships. 

5. Synthesis. Creating something new and good, based on some criterion . 
This creation can be something that communicates to an audience, that 
plans a successful goal -directed endeavor, or that subsumes a 
collection of Ideas within a new theory. 

6. Evaluation. Judging the value of something for a particular purpose . 
This means making a statement of something's worth based either on 
one's own well -developed criteria or on the wel 1 -understood criteria 
of another. 



* Adapted from TAXONOMY OF EDUCATIONAL OBJECTIVES: Th e Classification of 
Educational Goals: "HANDBOOK l: Cognitive Doman, ' by Benjamin S. Bloom, 
et al . Copyright 1956 by Longman Inc. Previously published by David 
McKay Company, Inc. By permission of Longman Inc. 



52 _ 



APPENDIX C 



MATERIALS USED IN 
ITEM RATING PROCESS 



53 



DIRECTIONS 

The Item Rating Scale (IRS) 1s Intended for use 1n making systematic con- 
tent validity judgments for domain-referenced tests by comparing test spe- 
cifications with Items. The scale 1s devised In such a way as to provide 
feedback, as well, for revising Items or specifications as necessary. In 
using the IRS, one test Item at a time 1s rated against a set of test 
specifications. 

1. Get a copy of the test specifications and the Items you wish to rate. 

2. Go through the categories of the IRS using the statements 1n each 
section to direct you 1n judging the compatabillty of your Item with 
the six test specification features and the two additional categories 
concerned with complexity Issues. 

3. In each section, rate the extent to which your Item appears to be a 
member of the hypothetical set of Items described by the test specifi- 
cations 1n that category. Use a scale of 0 to 10 to rate your Item, 
letting 0 Indicate a poor match and 10 a perfect one. 

The following guidelines are suggested for assigning number ratings 1n 
each section: 

0,1,2 This rating range should be used for Items that are completely 
unrelated, to the specification 1n the dimension you are 
rating. 

3,4,5 This rating range should be used for Items that are vaguely 

related and/or Inadequate. 
6,7 This rating range should be used for Items you feel would 

definitely require a second look and some revision, but which 

you feel reluctant to totally abandon. 
8,9 This rating range should be used for Items that you feel are 

good representative match-ups with the specifications although 

slightly off. 

10 This rating should be used for Items that are beyond a doubt 
perfect examples of the specifications. 

Enter your rating 1n the box provided. 

Space for taking notes has been provided. It 1s strongly suggested 
that you take advantage of this to make comments about the Item as you 
rate 1t. Such notes will be useful later 1n revising the Item or the 
specifications. 

4. Complete the Overall Item Rating sheet by carrying over the rating 
scores from each section to the appropriate line of the rating sheet. 
Make the calculations Indicated 1n the directions there, applying the 
rating weights where Indicated. 

5. Refer to the Interpretation Guide for rating explanations. 

6. REMEMBER YOU ARE RATING THE MATCH BETWEEN THE ITEM AND THE SPECIFICA- 
TION, NOT THE ITEM AND YOUR EXPECTATIONS OR STANDARDS! ALSO, EACH IRS 
CATEGORY SHOULD BE RATETTIflbEPENDENTLY OF THE OTHERS; FOR EXAMPLE , 
DOMAIN DESCRIPTION RATINGS DO NOT INCLUDE CONTENT LIMIT CONSIDERA- 
TIONS. USE THE STATEMENTS PROVIDED TO GUIDE YOUR JUDGMENTS. 

58 



54 



SPECIFICATION BEING RATED 

RATER TITLE 

COMMENTS: (additional comments can be made on the reverse side) 

ITEM NUMBER 



Domain 
Description 

* Content 
Limits 

*D1stractor 
Limits or 
Response 
Criteria 

Format 

Directions 

Sample 
Item 

Linguistic 
Simplicity 

*Th1nk1ng 
Complexity 



TOTAL 
* 14 » 



ERIC 



59 



i 



SPECIFICATION BETJG RATED 



RATER TITLE 



COMMENTS: (additional comments can be made on the reverse side) 



ITEM NUMBER 



Domain 
Description 

* Content 
Limits 

*D1stractor 
Limits or 
Response 
Criteria 



Format 

Directions 

Sampl e 
Item 

Linguistic 
Simplicity 

*Th1nk1ng 
Complexity 



TOTAL 



en 
in 



56 



OVERALL ITEM RATING 



1. Recopy item ratings from each section, making the indicated 
weighting adjustments for the starred features: Content Limits, 
Dlstractor Limits or Response Criteria, and Thinking Complexity. 



DOMAIN DESCRIPTION 

* CONTENT LIMITS (, _x 3) * 

*DISTRACTOR DOMAIN OR RESPONSE CRITERIA i m x 3) = 

FORMAT 

DIRECTIONS 

SAMPLE ITEM 

LINGUISTIC COMPLEXITY 
^THINKING COMPLEXITY ( x 3) * 

TOTAL 



9 

ERIC 



2. Total the scores. Divide the total by 14. This number is the 
overall item rating. 

OVERALL ITEM RATING * 14. * 

3. Item's technical adequacy: 

Acceptabl e 



Modifications needed 



4. Refer to the Interpretation Guide for assistance in making 
decisions about the Item and for suggestions on changing the 
Item or Its specifications. 

62 



57 



> 

i 



INTERPRETATION GUIDE 



ITEMS RATED 7 OR BETTER 



IF ALL THREE STARRED CRITICAL 
FEATURES ARE RATED 8 OR BETTER*, 
your Item is good, basically 1n 
conformity with the test specifica- 
tions. Review and rewrite efforts 
should be directed toward other 
features that scored low, e.g., 
Format, Use the statements in the 
IRS rating categories to guide your 
work. 



IF ONE CRITICAL FEATURE RECEIVED A 
RATING OF 7 OR LOWER* , go back to 
the specifications on that fea- 
ture. Try to better align your 
Item with the testing Intentions 
described 1n the specifications. 
Use the statements 1n the IRS to 
help direct your thinking. You may 
also have problems with other fea- 
ture:,. Rewrite the Item but review 
1t again to be certain all critical 
features are up to par. 



ITEMS RATED BELOW 7 



IF ALL THREE STARRED CRITICAL 
FEATURES ARE RATED 8 OR BETTER* , 
your Item is potentially a good one 
but has serious problems in presen- 
tation. Go back to the specifica- 
tions for those features receiving 
the low ratings. Clean up your 
Item. Use the statements in the 
IRS rating categories to guide your 
efforts. 



THE 



CRITICAL 

your 



IF ONE OR MORE OF 
FEATURES SCORED 7 OR LOWER 71 
Item Isn't worth the f1x-up 
effort. Before you start over, 
reconsider the specifications with 
which you are working; they may 
need to be beper conceptualized or 
more complete in their description 
of Item features. 



IF MORE THAN ONE CRITICAL FEATURE 
RECEIVED A RATING OF 7 OR LOWER* , 
the Item has serious validity prob- 
lems. If this 1s the kind of test 
item you want, then you should re- 
consider the specifications you are 
using. They may need to be better 
conceptual 1 zed, reconceptual 1zed, 
or more complete 1n their descrip- 
tion of Item qualities. If the 
specifications are closer to what 
you want to be testing, throw out 
the Item. Find or write a new 
1 tern. 



4 



' Before rating weights are 
appl led. 



ERIC 



R3 



